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Ashkin- Teller type perceptron models are introduced. Their maximal capacity per number of 
couplings is calculated within a first-step replica-symmetry-breaking Gardner approach. The results 
are compared with extensive numerical simulations using several algorithms. 
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I. INTRODUCTION 



The perceptron which was first analyzed with statisti- 
cal mechanics techniques in the seminal paper of Gard- 
ner is by now a well-known and standard model in 
theoretical studies and practical applications in connec- 
tion with learning and generalization A number 
of extensions of the perceptron model have been formu- 
lated, including many-state and graded-response percep- 
trons (e.g., P-pdl). Here we present some new extensions 
allowing for so-called coloured or Ashkin- Teller type neu- 
rons, i.e., different types of binary neurons at each site 
possibly having different functions. 

The idea of looking at such a model is based upon 
our recent work on Ashkin- Teller recurrent neural net- 
works [|l2|,|l3| . There we showed that for this model with 
two types of binary neurons interacting through a four- 
neuron term and equipped with a Hebb learning rule, 
both the thermodynamic and dynamic properties suggest 
that such a model can be more efficient than a sum of 
two Hopfield models. For example, the quality of pattern 
retrieval is enhanced through a larger overlap at higher 
temperatures and the maximal capacity is increased. For 
more details and an underlying neurobiological motiva- 
tion for the introduction of different types of neurons we 
refer to ||. 

In the light of these results an interesting question is 
whether such a coloured perceptron can still be more effi- 
cient than the standard perceptron. In other words, can 
it have a larger maximal capacity than the one of a stan- 
dard perceptron, which is known |^ to be etc = 2 (for ran- 
dom uncorrelated patterns). It has been suggested that 
this number is characteristic for all binary networks in- 
dependent of the multiplicity of the neuron interactions. 
Thereby, the capacity is defined as the thermodynamic 
limit of the ratio of the total number of bits per (input) 
neuron to be stored and the total number of couplings 
per (output) neuron Q. We remark that "input" and 
"output" refer specifically to the perceptron case. 



In the sequel the maximal capacity of coloured per- 
ceptron models is studied using the Gardner approach 
First-step replica-symmetry-breaking effects are 
evaluated and the analytic results arc compared with ex- 
tensive numerical simulations using various learning al- 
gorithms. 

The rest of this paper is organized as follows. In Sec. II 
we introduce two Ashkin- Teller type perceptron models. 
Section III contains the replica theory and determines the 
maximal capacity by calculating the available volume in 
the space of couplings both in the replica-symmetric (Sec. 
IIIA) and the first-step replica-symmetry-breaking ap- 
proximation (Sec. IIIB). Section IV describes the results 
of numerical simulations with algorithms obtained by 
generalizing various algorithms for simple perceptrons. 
In Sec. V we present our conclusions. Finally, two appen- 
dices contain some technical details of the derivations. 



II. THE MODEL 



Let us first formulate the coloured perceptron models. 
We consider p input patterns — {C,^} = {^f , r?f }, i = 
1, . . . ,N consisting out of two different types of patterns 
= {^f } ^^"i V'^ = {Vi}^ ^'^d a corresponding set of 
outputs — {^Q , Vq} fJ- — 1, ■ • • ,P which are determined 

by 



sign(/i^+Co^/i^) 



M=sign(ryo^/ii'-|-eo^/i^) 



(1) 
(2) 
(3) 



where hr {r = 1,2,3) are the local fields acting on the 
patterns ^, rj and their product ^77 respectively 



ht^ = 



1 

"2 



('■)^2 



(4) 

r = l,2,3. (5) 
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Both types of input patterns and their corresponding 
outputs are supposed to be independent identically dis- 
tributed random variables (IIDRV) taking the values +1 
or —1 with probability 1/2. The set of three equations 
(0)"® defines a mapping of the inputs onto the cor- 
responding outputs Co- We call it model 1. We remark 
that the specific form of the equations is related 

to the transition probabilities for a spin-flip in the dy- 
namics [Q. A second model, denoted by II, is defined 
by considering only the two equations (]l|) and (2). When 
I/13I > \hi\ and Ih^] > I/12I then the relations are 
satisfied by two (out of the four possible) values of the 
output Co) otherwise model II gives the same output as 
model I. In other words, due to the presence of the r^p and 

in the gain functions, model II contains more freedom 
and, strictly speaking, it is not a mapping. 

The sequential dynamics of these two models has been 
studied in the case of low loading with the Hebb rule 
and shown to lead to the same equilibrium behaviour 
[ p2[ . However, this is not guaranteed here since we are 
concerned with optimal couplings maximizing the load- 
ing capacity. At this point we remark that when all 
are equal to zero we find back two independent standard 
binary perceptron models. In the sequel we take the cou- 
plings to satisfy the spherical constraint = ^/N. 



III. REPLICA THEORY FOR THE MAXIMAL 
CAPACITY 



The coloured perceptron is trained to store correctly 
p = ^aN patterns with a the loading capacity. The 
factor 3/2 follows naturally from the definition of capac- 
ity given in the introduction. A pattern is stored cor- 
rectly when the so-called aligning field is bigger than 
a certain constant k > whereby the latter indicates 
the stability. It is a measure for the size of the basin of 
attraction of that pattern. Specifically we require that 



A4^,({^}) = (« + «)>«^?.>0 



(6) 
(7) 
(8) 



with {J} — {jj^^''} denoting the configurations in the 
space of interactions. For = all patterns 

that satisfy equations (|l|)-@ also satisfy @-(||). We re- 
mark that for model II the last inequality is superfluous. 



The aim is then to determine the maximal value of the 
loading a for which couplings satisfying can still 

be found. In particular, the question whether this model 
can be more efficient than the existing two-state models 
is relevant. 

Following refs. ||f|,0 wc formulate the problem as an 
energy minimization in the space of couplings with the 
formal energy function defined as 



X e(A(;({j})-A^,)e(A^^^({j})-«4,) 



(9) 



We remark that for model II the third 8-factor is ab- 
sent. The quantity above counts the number of weakly 
embedded patterns, i.e., the patterns with stability less 
than K^, Kri, K^n- Therefore, the minimal energy gives the 
minimal number of patterns that are stored incorrectly. 
This number is zero below a maximal storage capacity 

The basic quantity to start from is the partition func- 
tion 



(...) 



{J} 



N 



(10) 
(11) 



with P the inverse temperature. As usual it is In Z 
which is assumed to be a self-averaging extensive quan- 
tity [HJlSl. The related free energy per site 



/ = - lim —\nZ((3) 
is equal, in the limit /3 — > 00, to 



N 



lim 



{E{{J})exp[~/3E{{Jm{J} 
NZ{p) 



(12) 



(13) 



which is the minimal fraction of wrong patterns (recall 
eq. @). 

In order to perform the average over the disorder in 
the input patterns C'^ and the corresponding outputs Co 
we employ the replica method. The calculations pro- 
ceed in a standard way although the technical details are 
much more complex. Introducing the order parameters 



(r) 



we write following |ll| 



with r ~ 1,2,3 and 7, t = 1, 




(14) 



r,7,T>7 



2" 



/ / / /nd.^'^expK:..^'^A^'^ 



2 



exp 



r,7 



r,7,r >7 



where (...) denotes the average over the patterns, r' = 
1,2,3 for model I and 1,2 for model II. Because of the 
latter we remark that for model II the formula for Go 
can be simplified: the integrals with respect to A'^''' and 



are not present and thus x^'^, x^'^ and A^''' have to be 



set to zero. Because of this simplification we only out- 
line explicitly the calculations for model II in the sequel. 
The corresponding formulas for model I can be found in 
Appendix B. 



A. Replica symmetric anzatz 



We continue by making the replica-symmetric (RS) an- 



zatz qifr = q 
venience, we set q 



(1) 



i(jf,eZ^ — ie^ . Moreover, for con- 
= g(2) = q{3) = q The latter is 



justified for model I because of the symmetry present 
in this model. Furthermore, since we are going to take 
all q^^^ — > 1 in the Gardner-Derrida analysis anyway, we 
keep this equality also for model II. Taking then the lim- 
its P — > cx), N oo and n — > we arrive, in the case of 
model II, at 

V = lim — < In Z > 

= / D(si ((7/2))D(.S2 {3q/2))lmpRs{n^, ^i, S2,q) 



In 27r 



(15) 



with 



^Rs{K(,K,l,Sl,S2,q) = / n^^*"^*^-^)) 

where v = 1,2, D(s(y)) = dsexp(— -^s^)/\/27ry is a 



modified Gaussian measure, 

y^(i(K4 + K^)-s2) 



h = 



%/2(k,, - 52 - 5l) _ 
\/2(-K^ + S2 - Si) 



U2V3 , 
+ M2\/3 



(17) 
(18) 
(19) 



^/T~q 

and q takes those values that minimize u, the available 



volume in the space of couplings. For the corresponding 
expression in the case of model I we refer to Appendix B. 

Taking and supposing that the maxi- 

mal capacity, — ctRSi is signaled by the Gardner- like 
criterion g — > 1 we obtain 



lim 



-ln(l-g) 



1 



In 27r 



-1 I /D(si ((Z/2))D(S2 (3(?/2))lni^fls(K,A«,si,S2,g)^ 

This maximal capacity as a function of k is shown for 
both models in figs. |^ and ^ as a full line. For model I 
we obtain, e.g., afls(«^ = 0) = 1.92, a value that is 
smaller than the Gardner capacity for the simple per- 
ceptron. For model II however, we get the interesting 
result that q;_r5(k 0) 2.74 > 2. 



B. First-step replica symmetry breaking 

It is straightforward to show geometrically that learn- 
ing almost antiparallel patterns, i.e., patterns satisfying 
i^'^^o^^'^Vo) ~ -(r^o.^'^^o) results in a splitting of the 
space of couplings into disconnected regions. This sug- 
gests that RS is broken and, consequently, the results for 
ctRS found in Sec. IIIA are only upperbounds for the true 
capacity. Therefore, we want to improve the RS results 
by applying the first step of Parisi's replica-symmetry- 
breaking (RSB) scheme (e.g., |^l|). So, we assume that 
the q^^T in equation (|l^) have the following matrix block 
structure 



(20) 



I (r) , 

I, (/q otherwise 



if int (i^) = int (^i^) 



(21) 



(r) 

where n is the size of the matrix q-yr , tn is the number of 
diagonal blocks and int(x) denotes the integer part of x. 

For model II we take g^V = 977-'' 7^ q^r refiecting the 
symmetry of this model. For model I we repeat that all 
g*^'''''s can be taken equal. We then consider the limits 
q^^ 1 and rt ^ in such a way that m/(l — qi), with 
q^^ — gj^-* — = gi, remains finite. After a tedious 
calculation we arrive at the following expression for the 
RSBl maximal capacity for model II 



3 



oiBSBiin) = mill 



ln(l + Af) 



(l + M)(l-9<i') 



+ iln(l + Af3 



2 (l + M3)(l-g<l)) 



(22) 



with 



(3) 
% 



1-90 



(1)^ 



91 



(23) 



and Dti = dt^ exp(— itf)/V27r a Gaussian measure. The 

exphcit form of the function "tpRSBiii^, hyh, Qq^K Qo^\ M) 
can be found in appendix A. An analogous form for 
model I is written down in Appendix B. 

The results are presented in figs. |l| and || as full lines. 
As expected they lie below the RS results confirming the 
breaking of RS, e.g., aB,sBi{K = 0) = 1.83 for model I 
and 2.28 for model II. We remark that the breaking for 
model II is stronger than for model I, the reason being 
that model II allows more freedom as explained in the 
introduction. Finally, on the basis of results in the liter- 
ature for the simple perceptron [p^ , we expect that 
the RSBl results are very close to the exact ones. This 
is further examined by performing numerical simulations 
as described in the following section. 



IV. NUMERICAL SIMULATIONS 

The idea of these simulations is to train the network 
with a certain learning algorithm in order to learn as 
many random patterns as possible. The main technical 
difficulties are to find an efficient algorithm and prove its 
convergence. 

We have tried to generalize various algorithms pro- 
posed for simple perceptrons [^-20|. The most effective 
ones appeared to be some particular generalization of the 
adaptive Gardner algorithm and the Adatron algo- 
rithm p9[ |. In the sequel we only report on the results 
obtained with these two algorithms. We remark that we 
have chosen — = n^n = k in all simulations. 

One of the algorithms that has demonstrated its effi- 
ciency and for which convergence has been shown in the 
case of the standard perceptron is given in ref . [|l^ . It is 
an adaptive version of the original algorithm proposed by 
Gardner Q. Using heuristic arguments presented in | |T^ ] 
we have constructed for the coloured perceptron model II 
the following analogous learning rule 



4'^ - Jl" + Co'er ^ - j e [k, - XI ) (24) 



7(1) 



1 



Ji'^ ^ + - A^;) e («, - x>i) (25) 



7(2) 



4'^ - 4'^ + ?o"%"er^r ^ - Af ) e - ) 

+ («, - Aj;) e {k^ - Aj;)] . (26) 



The form of the algorithm for model I is a bit different 
and given in Appendix B. This algorithm should be car- 
ried out sequentially over the patterns and sequentially 
or parallel over the couplings as long as one of the ar- 
guments of the Q functions is positive. It appears to 
have the characteristics of the most efficient, non-linear 
algorithm discussed in Q . 

Using this learning rule we have trained networks of 
sizes 50 < iV < 1000 sites (depending on the value of k) 
in order to store perfectly as many randomly chosen pat- 
terns as possible. For each value of k we have calculated 
the maximal capacity for different N and extrapolated 
the results to = oo. Results for a given value of k and 
N are averages over 1000 samples. As shown in figs. |l| 
and 1^ this algorithm performs especially well for small 
values of k for both the models I and II. 

The second algorithm we report on is the Adatron al- 
gorithm |]l^ which works in a different way. Instead of 
searching the maximal capacity for a given stability it 
tries to find the maximal stability for a given capacity. 
The derivation of this algorithm and a proof of its conver- 
gence are based upon the assumption that the problem 
can be formulated as a quadratic optimization with linear 
constraints |l^,01. Such a formulation can not be given 
for the coloured perceptron model, because the three dif- 
ferent types of couplings have to be normalized indepen- 
dently and because the stability conditions (^)-(0) are 
more complex. Hence, a straightforward generalization 
similar to the one for the Potts model |@] is not possible. 
Below we describe a learning rule that tries to incorpo- 
rate the ideas of the Adatron approach. We assume that 
the couplings can be written in the form (cfr., |l^ and 
references therein) 



j: 



(1) 



i I 



(27) 



where xl^ (r = 1,2,3) are the so called embedding 
strengths of pattern /x. Then, in the case of model II 
the couplings are updated by modifying x{f with the fol- 
lowing increments 



1 



6x'l — - max{— — , 7(1 — riiA^')} 



^^2 = o max{-a;^ - 4', 7(1 - ^J2A^)} , 



(28) 
(29) 



Sxt^ = -(maxl-x^" - x'^,j{l - risA^)} 

+ max{-a;^-x^,7(l-n3A^;)}). (30) 
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This is done sequentially over the patterns. We remark 
that again the algorithm for model I is somewhat dif- 
ferent (see Appendix B). For each value of the capacity 
we have considered system sizes 50 < < 500 and ex- 
trapolated the results to iV = oo. The best results were 
obtained for a learning rate 7 £ (0,2). Results for each 
size are averages over 1000 samples. For small values 
of the capacity the algorithm gives better results, both 
in the case of models I and II than the first algorithm 
we have discussed, as shown in figs. and ^. For larger 
values of the capacity, however, it performs worse. The 
results for the Adatron algorithm are displayed only in 
the region where they are better than the results for the 
Gardner algorithm. We remark that the numerical simu- 
lations with the different algorithms give different results 
and that we have not shown their convergence analyti- 
cally such that, in principle, the values for ac obtained 
here are lower bounds. 

Looking at figs. |l| and || in more detail we see that 
for the whole range of k the values of the maximal ca- 
pacity in model II are larger than those of a standard 
binary perceptron. For k — 0, e.g., the simulations give 
ac — 2.26 ± 0.01, which is bigger than the maximal ca- 
pacity of the binary perceptron model j^] and the binary 
many-neuron interaction model both of which have 
ttc = 2. For model I the maximal capacity at k = found 
by simulations is 1.78 ± 0.01. 



method used is a generalization of the Gardner approach 
and both the RS and RSBl results have been discussed. 
We expect that the latter give very close upperbounds 
for the exact values. 

Extensive numerical simulations have been performed 
for finite systems and extrapolated to iV = 00. The adap- 
tive Gardner algorithm and the Adatron algorithm give 
the best, but different results. Hence, the results of the 
simulations can be considered only as lower bounds for 
the exact maximal capacity. Additional work looking for 
improved algorithms would be welcome. 

Comparing both the RSBl results and the results from 
numerical simulations we conclude that they are in good 
agreement. For bigger values of k they even completely 
coincide. For model I we find that at k = the maximal 
capacity satisfies 1.78 < ac < 1.83. This suggests that 
it is equal to the maximal capacity of the Q — 4-Potts 
perceptron, i.e., ac = 1.83 (after appropriate rescaling 
of the latter ||7|). This would parallel the situation for 
Hebb learning ]13[. For model II we have for k = that 
2.26 < ac < 2.28, which is larger than the maximal ca- 
pacity of the standard binary perceptron. This is due to 
the fact that model II is not a strict mapping such that 
it allows for more freedom in the determination of the 
couplings. 
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V. CONCLUDING REMARKS 

In this work we have calculated the maximal capac- 
ity per number of couplings for two coloured perceptron 
models. Compared with the standard perceptron these 
models have two neuronal variables per site and a lo- 
cal field that contains higher order neuron terms. The 



The authors would like to thank M. Bouten and J. van 
Mourik for critical discussions. 



APPENDIX A: TECHNICAL DETAILS FOR 
MODEL II 



The function xl:j^sBi{K,ti,t2,qQ^\qj^\ M) in formula 



3) reads 



tpRSBi (k, ti,t2,qo\qo\M'^ = ^e'^'' J Ds 



1 

H ( 

2ci 

1 

H ( 

2C2 

1 



1 

+ 2 



^(«l-<52) 

Ds 

-00 

-^(«l-72) 



Ds 



,03 



-^("1—73) 



Ds 



erf 



Ds 



Ds 



erflJ- 



1 + erf 

1 + erf 

1 -I- erf 1^ 

1 + erf ^ 
3r / X3 



X3 




S3 H s 

3r ci 



—= +S2-\ S 

v/3r ci 



3r / X2 
2? 



72 + — s 
3r C2 



X3 



3r 



h - -s 
3r c' 



■ 73 H s 

C2 



erf 



^2,1 
V3f C 



3r / X2 



3r 



crf|-J|^^^ 



3r 



5 



with Ds a Gaussian measure and 



c VTTM, c' ^ VTTJh, ci ^ ^l + M(l + 3r), 

J- "Jo 



X2 



3r-t2 



(TT : ■^3 



1 Mxl 



1 Majg 



r _ ■/3rMx2 r 
^2 — ^2 , 03 



^/iK+y/q'tl , 1 (1) , 2 (3) 



APPENDIX B: FORMULA FOR MODEL I 



For model I the calculations are very similar. Some re- 
sulting expressions, however, have a somewhat different 
structure. For completeness we write down these expres- 
sions here. 

For the available space of couplings we get in the RS 
approximation (compare ([l5|)) 



-a 



J Y\.^i^r{q))ln[ipRs{n^,Ku,Hiu,Si,S2,S3,q)] - ^aln4+ ^ ^ln(l - (?) + + ln27r 



(31) 



with 



1pRsiK^,K^,K^^, Si, S2, S3, q) 



dw.q 



dw-j 



dui 



du3 



du-) 



dwi 



du3 



dm 



dui / du2 / du3 rQ 



(32) 



where 



k 
h 
k 
It 
Li 
L3 



U + s. 



Li+ L2 + Si + S2 







- q 


Li- 


1-^3- 


f Sl + S3 




VI 


- q 


L2- 


1-^3- 


f S2 + S3 



z = 1,2,3, 

- Ul, Iq = li + Ul — U2, 

- "1, k = k+Ul- U3, 

- M2, l9 = l7 + U2~ M3, 



-(k^ - -t- K^rj), L2 = 2 '•^'^^ + K,, -I- K{^), 
1 



(k^ 



and q taking those values that minimizes v. Thus, for 
the maximal capacity in the RS 
approximation can be written as 

ai?.s(K) = 

r -ln(l-g)-^-ln27r | 



lim 



\ / rir D(sr(Q))-0i?s(K, Si, S2, S3, 9) - ln4 j 

For the RSBl approximation with the form of the order 
parameters given by ( pi] ) the maximal capacity reads 



ln(l + M) 



qoM 



(l + A/)(l-go) 



'^^ 1 /rirD^rlnT/'KSBl {K.,tl,t2,t3,qO,M) 



with '4'Ti.SBiiK,ti,t2,t3, qq, M) a linear combination of 
thirty-four, mostly double, integrals over error functions. 
An interested reader can find a complete formula for 
ipRSBi{K,ti,t2,t3,qo,M) in 

Finally, the learning algorithms for model I differ in 
the way that the couphngs J^^^ and J^^^ are updated. 
We have for the adaptive Gardner algorithm 



J, 



(1) 



J.. 



(1) 



■I 



[k^ - ) e [k^ - ) 

(''-^P' + «i[(«,-A^;)e(«:,-A;;) 



instead of (g4|) and 
we take 



and for the Adatron algorithm 
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+ max{-x5' - a;^,7(l - niX'i )}) 



5xi^ = 



-(max{-a;^ - x^, 7(1 - naA^;)} 



+ max{— a;^ 
instead of (Eq) and 



''2 ' 



7(1 - n2Xl^)}) 
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FIG. 1. The maximal capacity of the coloured perceptron model I as a function of k. Theoretical results, ans and aRSBi, 
versus simulations. The circles are the results for the adaptive Gardner algorithm, the diamonds for the Adatron algorithm. 
The error bars are smaller than the size of the symbols (not in the inset). The solid thin lines are polynomial fits to these 
results. The maximal capacity of a simple perceptron is indicated with a broken line. 
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